Normal Distribution (norm)#

The normal (Gaussian) distribution is the canonical model for additive noise and aggregated effects.

It appears throughout statistics and machine learning via the Central Limit Theorem, as the distribution of measurement errors, and as the maximum-entropy distribution under mean/variance constraints.

Notebook roadmap#

  1. Title & classification

  2. Intuition & motivation

  3. Formal definition (PDF/CDF)

  4. Moments & properties

  5. Parameter interpretation

  6. Derivations (\(\mathbb{E}[X]\), \(\mathrm{Var}(X)\), likelihood)

  7. Sampling & simulation (NumPy-only)

  8. Visualization (PDF, CDF, Monte Carlo)

  9. SciPy integration (scipy.stats.norm)

  10. Statistical use cases

  11. Pitfalls

  12. Summary

import math

import numpy as np
import scipy
from scipy import special, stats

import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

SEED = 7
rng = np.random.default_rng(SEED)

np.set_printoptions(precision=4, suppress=True)

print("numpy ", np.__version__)
print("scipy ", scipy.__version__)
print("plotly", plotly.__version__)
numpy  1.26.2
scipy  1.15.0
plotly 6.5.2

Prerequisites & notation#

Prerequisites

  • comfort with basic calculus (integration by parts)

  • basic probability (PDF/CDF, expectation, likelihood)

Notation

  • \(X \sim \mathcal{N}(\mu, \sigma^2)\) means: mean \(\mu\in\mathbb{R}\), standard deviation \(\sigma>0\).

  • \(Z \sim \mathcal{N}(0,1)\) denotes the standard normal.

  • \(\varphi\) and \(\Phi\) denote the standard normal PDF and CDF.

SciPy uses a location–scale parameterization: stats.norm(loc=μ, scale=σ).

1) Title & classification#

  • Name: norm (Normal / Gaussian distribution)

  • Type: continuous

  • Support: \(x \in (-\infty, \infty)\)

  • Parameter space:

    • location (mean): \(\mu \in \mathbb{R}\)

    • scale (std dev): \(\sigma \in (0, \infty)\)

Equivalent parameterizations you’ll also see:

  • variance \(\sigma^2 > 0\)

  • precision \(\tau = 1/\sigma^2 > 0\)

2) Intuition & motivation#

What it models#

The normal distribution often models the sum of many small, independent effects. A classic mental model is measurement error:

\(\text{observed} = \text{true signal} + \text{noise}\), where the noise is approximately Gaussian.

Two key reasons it shows up so often:

  1. Central Limit Theorem (CLT): standardized sums of many weakly dependent variables tend toward a normal distribution.

  2. Maximum entropy: among all continuous distributions with a fixed mean and variance, the normal has the largest differential entropy (it is the “least informative” choice under those constraints).

Typical real-world use cases#

  • Sensors & experiments: additive noise in physical measurements

  • Averages/aggregates: sampling distributions of means (often approximately normal)

  • Error models: regression residuals, Kalman filters, Gaussian processes

  • Latent-variable models: Gaussian priors and Gaussian likelihoods (conjugacy)

Relations to other distributions#

  • Standardization: if \(X \sim \mathcal{N}(\mu,\sigma^2)\), then \((X-\mu)/\sigma \sim \mathcal{N}(0,1)\).

  • Chi-square: if \(Z \sim \mathcal{N}(0,1)\), then \(Z^2 \sim \chi^2_1\).

  • Additivity: sums of independent normals are normal (means/variances add).

  • Student-\(t\): arises from a normal divided by a chi-square term.

  • Lognormal: if \(Y \sim \mathcal{N}(\mu,\sigma^2)\), then \(\exp(Y)\) is lognormal.

3) Formal definition#

Let \(X \sim \mathcal{N}(\mu, \sigma^2)\) with \(\mu\in\mathbb{R}\) and \(\sigma>0\).

PDF#

[ f(x\mid\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}},\exp!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),\qquad x\in\mathbb{R}. ]

For the standard normal \(Z\sim\mathcal{N}(0,1)\), the PDF is [ \varphi(z) = \frac{1}{\sqrt{2\pi}},e^{-z^2/2}. ]

CDF#

The CDF is [ F(x\mid\mu,\sigma) = \mathbb{P}(X\le x) = \Phi!\left(\frac{x-\mu}{\sigma}\right), ] where \(\Phi\) is the standard normal CDF.

There is no elementary closed form, but it can be written using the error function: [ \Phi(z) = \tfrac{1}{2}\left(1 + \operatorname{erf}!\left(\tfrac{z}{\sqrt{2}}\right)\right). ]

4) Moments & properties#

For \(X \sim \mathcal{N}(\mu, \sigma^2)\):

Moments#

  • Mean: \(\mathbb{E}[X] = \mu\)

  • Variance: \(\mathrm{Var}(X) = \sigma^2\)

  • Skewness: \(0\) (symmetry)

  • Kurtosis: \(3\) (excess kurtosis \(0\))

  • Median / mode: \(\mu\)

MGF and characteristic function#

  • MGF (all real \(t\)): [ M_X(t) = \mathbb{E}[e^{tX}] = \exp!\left(\mu t + \tfrac{1}{2}\sigma^2 t^2\right). ]

  • Characteristic function: [ \varphi_X(t) = \mathbb{E}[e^{itX}] = \exp!\left(i\mu t - \tfrac{1}{2}\sigma^2 t^2\right). ]

Entropy (differential, in nats)#

[ H(X) = \tfrac{1}{2}\ln!\left(2\pi e,\sigma^2\right). ]

Other notable properties#

  • Affine invariance: if \(Y=aX+b\), then \(Y\) is normal with mean \(a\mu+b\) and variance \(a^2\sigma^2\).

  • Additivity: sums of independent normals are normal (and covariances add in the multivariate case).

  • Maximum entropy under fixed mean/variance constraints.

SQRT_2PI = math.sqrt(2.0 * math.pi)


def norm_pdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if scale <= 0:
        raise ValueError("scale must be > 0")
    z = (x - loc) / scale
    return np.exp(-0.5 * z**2) / (scale * SQRT_2PI)


def norm_cdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if scale <= 0:
        raise ValueError("scale must be > 0")
    z = (x - loc) / scale
    return special.ndtr(z)


def norm_logpdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if scale <= 0:
        raise ValueError("scale must be > 0")
    z = (x - loc) / scale
    return -0.5 * z**2 - math.log(scale) - 0.5 * math.log(2.0 * math.pi)


def norm_loglik(loc: float, scale: float, x: np.ndarray) -> float:
    x = np.asarray(x, dtype=float)
    if scale <= 0 or np.any(~np.isfinite(x)):
        return -np.inf
    return float(np.sum(norm_logpdf(x, loc=loc, scale=scale)))


def norm_mle(x: np.ndarray) -> tuple[float, float]:
    """MLE for (μ, σ) under iid N(μ, σ²).

    Note: the MLE for σ uses ddof=0 (biased as an estimator of σ).
    """

    x = np.asarray(x, dtype=float)
    mu_hat = float(np.mean(x))
    sigma_hat = float(np.sqrt(np.mean((x - mu_hat) ** 2)))
    return mu_hat, sigma_hat


def sample_norm_box_muller(
    n: int,
    loc: float = 0.0,
    scale: float = 1.0,
    rng: np.random.Generator | None = None,
) -> np.ndarray:
    """NumPy-only sampling via the Box–Muller transform.

    Returns n iid samples from N(loc, scale^2).
    """

    if rng is None:
        rng = np.random.default_rng()
    if n < 0:
        raise ValueError("n must be >= 0")
    if scale <= 0:
        raise ValueError("scale must be > 0")

    m = (n + 1) // 2  # number of (Z0, Z1) pairs
    u1 = rng.random(m)
    u2 = rng.random(m)

    # Avoid log(0) when u1 is exactly 0.
    u1 = np.maximum(u1, np.nextafter(0.0, 1.0))

    r = np.sqrt(-2.0 * np.log(u1))
    theta = 2.0 * math.pi * u2

    z0 = r * np.cos(theta)
    z1 = r * np.sin(theta)

    z = np.empty(2 * m, dtype=float)
    z[0::2] = z0
    z[1::2] = z1
    z = z[:n]

    return loc + scale * z

5) Parameter interpretation#

Location \(\mu\)#

  • Shifts the distribution left/right.

  • \(\mu\) is the center of symmetry, and it equals the mean/median/mode.

Scale \(\sigma\)#

  • Controls dispersion: larger \(\sigma\) spreads mass out and lowers the peak.

  • About 68% / 95% / 99.7% of mass lies within \(\mu \pm 1\sigma\), \(\mu \pm 2\sigma\), \(\mu \pm 3\sigma\) (the “68–95–99.7 rule”).

Shape changes#

All normal PDFs are bell-shaped and symmetric; changing \(\mu\) shifts the bell, changing \(\sigma\) changes its width.

x = np.linspace(-8, 8, 800)

params = [
    (0.0, 1.0),
    (0.0, 2.0),
    (1.5, 1.0),
    (-2.0, 0.6),
]

fig = go.Figure()
for mu, sigma in params:
    fig.add_trace(
        go.Scatter(
            x=x,
            y=norm_pdf(x, loc=mu, scale=sigma),
            mode="lines",
            name=f"μ={mu:g}, σ={sigma:g}",
        )
    )
    fig.add_vline(x=mu, line_dash="dot", opacity=0.25)

fig.update_layout(title="Normal PDFs for different (μ, σ)", xaxis_title="x", yaxis_title="f(x)")
fig.show()

6) Derivations#

We derive \(\mathbb{E}[X]\), \(\mathrm{Var}(X)\), and the likelihood/MLE.

Expectation#

For the standard normal \(Z\sim\mathcal{N}(0,1)\) with PDF \(\varphi(z)\), [ \mathbb{E}[Z] = \int_{-\infty}^{\infty} z,\varphi(z),dz. ] The integrand \(z\,\varphi(z)\) is an odd function (since \(\varphi\) is even), so the integral over a symmetric domain is \(0\).

For \(X = \mu + \sigma Z\): [ \mathbb{E}[X] = \mu + \sigma,\mathbb{E}[Z] = \mu. ]

Variance#

First compute \(\mathbb{E}[Z^2]\): [ \mathbb{E}[Z^2] = \int_{-\infty}^{\infty} z^2,\varphi(z),dz. ] Use the fact that \(\varphi'(z) = -z\,\varphi(z)\), so \(z\,\varphi(z) = -\varphi'(z)\). Then [ \mathbb{E}[Z^2] = \int z^2\varphi(z),dz = -\int z,\varphi’(z),dz. ] Integrate by parts with \(u=z\) and \(dv=\varphi'(z)\,dz\): [ -\int z,\varphi’(z),dz = -\big[z,\varphi(z)\big]_{-\infty}^{\infty} + \int \varphi(z),dz. ] The boundary term is \(0\) because \(z\,\varphi(z)\to 0\) as \(|z|\to\infty\), and \(\int \varphi(z)\,dz = 1\). Hence \(\mathbb{E}[Z^2]=1\), so \(\mathrm{Var}(Z)=1\).

For \(X=\mu+\sigma Z\): [ \mathrm{Var}(X) = \sigma^2,\mathrm{Var}(Z) = \sigma^2. ]

Likelihood and MLE#

For iid data \(x_1,\dots,x_n\) from \(\mathcal{N}(\mu,\sigma^2)\), the likelihood is [ L(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}\exp!\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right). ] The log-likelihood is [ \ell(\mu,\sigma) = -n\ln\sigma - \tfrac{n}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2. ] Setting derivatives to zero gives the MLEs: [ \hat\mu = \bar x,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\bar x)^2. ] (The familiar unbiased sample variance uses \(n-1\) instead of \(n\).)

# MLE demo on simulated data
true_mu = 1.5
true_sigma = 0.8
n = 600

x = sample_norm_box_muller(n, loc=true_mu, scale=true_sigma, rng=rng)

mu_hat, sigma_hat = norm_mle(x)

loglik_true = norm_loglik(true_mu, true_sigma, x)
loglik_hat = norm_loglik(mu_hat, sigma_hat, x)

true_mu, true_sigma, mu_hat, sigma_hat, loglik_true, loglik_hat
(1.5,
 0.8,
 1.4563135974860988,
 0.7969433986282367,
 -716.0835277507142,
 -715.1801474751153)

7) Sampling & simulation (NumPy-only)#

Box–Muller transform#

Let \(U_1, U_2 \sim \mathrm{Uniform}(0,1)\) iid. Define [ R = \sqrt{-2\ln U_1},\qquad \Theta = 2\pi U_2. ] Then [ Z_0 = R\cos\Theta,\qquad Z_1 = R\sin\Theta ] are iid \(\mathcal{N}(0,1)\). Finally, to sample \(X\sim\mathcal{N}(\mu,\sigma^2)\), return \(X = \mu + \sigma Z\).

Numerical note: if \(U_1=0\), then \(\ln U_1\) is undefined, so we clip \(U_1\) away from 0.

# Sampling: compare histogram to the true PDF
mu = 0.7
sigma = 1.3
n = 60_000

samples = sample_norm_box_muller(n, loc=mu, scale=sigma, rng=rng)

x_grid = np.linspace(mu - 4.5 * sigma, mu + 4.5 * sigma, 500)

fig = px.histogram(
    samples,
    nbins=70,
    histnorm="probability density",
    title=f"Monte Carlo samples vs PDF (n={n}, μ={mu:g}, σ={sigma:g})",
    labels={"value": "x"},
)
fig.add_trace(go.Scatter(x=x_grid, y=norm_pdf(x_grid, mu, sigma), mode="lines", name="true pdf"))
fig.update_layout(yaxis_title="density")
fig.show()

samples.mean(), samples.std(ddof=0)
(0.7031752075013465, 1.2937846790607508)

8) Visualization (PDF, CDF, Monte Carlo)#

We’ll visualize:

  • the PDF for multiple \(\sigma\) values

  • the CDF and an empirical CDF from Monte Carlo samples

# PDF and CDF for multiple scales
mu = 0.0
sigmas = [0.5, 1.0, 2.0]
x = np.linspace(-8, 8, 800)

fig_pdf = go.Figure()
fig_cdf = go.Figure()

for s in sigmas:
    fig_pdf.add_trace(go.Scatter(x=x, y=norm_pdf(x, mu, s), mode="lines", name=f"σ={s:g}"))
    fig_cdf.add_trace(go.Scatter(x=x, y=norm_cdf(x, mu, s), mode="lines", name=f"σ={s:g}"))

fig_pdf.update_layout(title="Normal PDF (μ=0)", xaxis_title="x", yaxis_title="f(x)")
fig_cdf.update_layout(title="Normal CDF (μ=0)", xaxis_title="x", yaxis_title="F(x)")

fig_pdf.show()
fig_cdf.show()
# Empirical CDF vs true CDF
mu = -0.5
sigma = 1.2
n = 25_000
samples = sample_norm_box_muller(n, loc=mu, scale=sigma, rng=rng)

xs = np.sort(samples)
ys = np.arange(1, n + 1) / n

x_grid = np.linspace(mu - 4.5 * sigma, mu + 4.5 * sigma, 600)

fig = go.Figure()
fig.add_trace(go.Scatter(x=xs, y=ys, mode="lines", name="empirical CDF"))
fig.add_trace(go.Scatter(x=x_grid, y=norm_cdf(x_grid, mu, sigma), mode="lines", name="true CDF"))
fig.update_layout(
    title=f"Empirical CDF vs true CDF (n={n}, μ={mu:g}, σ={sigma:g})",
    xaxis_title="x",
    yaxis_title="F(x)",
)
fig.show()

9) SciPy integration (scipy.stats.norm)#

SciPy’s norm is parameterized as stats.norm(loc=μ, scale=σ).

Useful methods include:

  • pdf, logpdf

  • cdf, sf (survival function), and the numerically stable logcdf, logsf

  • ppf (quantiles)

  • rvs (sampling)

  • fit (MLE fitting)

mu = 0.7
sigma = 1.3
dist = stats.norm(loc=mu, scale=sigma)

x = np.linspace(mu - 3 * sigma, mu + 3 * sigma, 7)
pdf_vals = dist.pdf(x)
cdf_vals = dist.cdf(x)

# Sampling
samples = dist.rvs(size=5, random_state=rng)

# Fit (MLE)
big_sample = dist.rvs(size=5_000, random_state=rng)
mu_fit, sigma_fit = stats.norm.fit(big_sample)

x, pdf_vals, cdf_vals, samples, (mu_fit, sigma_fit)
(array([-3.2, -1.9, -0.6,  0.7,  2. ,  3.3,  4.6]),
 array([0.0034, 0.0415, 0.1861, 0.3069, 0.1861, 0.0415, 0.0034]),
 array([0.0013, 0.0228, 0.1587, 0.5   , 0.8413, 0.9772, 0.9987]),
 array([-0.3951,  1.3869,  0.9551,  0.3398, -0.1837]),
 (0.6852965688237751, 1.3020645349649365))
# Tail-stability: logcdf/logsf vs log(cdf/sf)
z = -40.0
cdf_direct = stats.norm.cdf(z)
logcdf_stable = stats.norm.logcdf(z)

z2 = 40.0
sf_direct = stats.norm.sf(z2)
logsf_stable = stats.norm.logsf(z2)

(cdf_direct, logcdf_stable), (sf_direct, logsf_stable)
((0.0, -804.6084420137539), (0.0, -804.6084420137539))

10) Statistical use cases#

Hypothesis testing (z-test for a mean, \(\sigma\) known)#

If \(X_1,\dots,X_n \sim \mathcal{N}(\mu,\sigma^2)\) with known \(\sigma\), then under \(H_0: \mu=\mu_0\), [ Z = \frac{\bar X - \mu_0}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1). ] A two-sided p-value is \(p = 2\,\mathbb{P}(|Z|\ge |z_{obs}|)\).

Bayesian modeling (Normal–Normal conjugacy for a mean, \(\sigma\) known)#

Prior: \(\mu \sim \mathcal{N}(\mu_0,\tau_0^2)\). Likelihood: \(X_i\mid\mu \sim \mathcal{N}(\mu,\sigma^2)\) with known \(\sigma\).

Posterior: \(\mu\mid x \sim \mathcal{N}(\mu_n,\tau_n^2)\) where [ \tau_n^2 = \left(\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}\right)^{-1},\qquad \mu_n = \tau_n^2\left(\tfrac{\mu_0}{\tau_0^2} + \tfrac{n\bar x}{\sigma^2}\right). ]

Generative modeling#

Normals are building blocks for generative models:

  • Linear Gaussian models (e.g., Kalman filters): Gaussian latent states + Gaussian noise

  • Gaussian mixtures (GMMs): weighted sums of normals for multi-modal densities

  • Multivariate normal: correlated features via linear transforms of independent normals

# Hypothesis test example: two-sided z-test for a mean (σ known)
mu0 = 0.0
sigma_known = 2.0
n = 40

# Simulated measurements with true mean != mu0
true_mu = 0.9
data = sample_norm_box_muller(n, loc=true_mu, scale=sigma_known, rng=rng)

xbar = data.mean()
z_obs = (xbar - mu0) / (sigma_known / math.sqrt(n))
p_two_sided = 2.0 * stats.norm.sf(abs(z_obs))

alpha = 0.05
z_crit = stats.norm.ppf(1 - alpha / 2)
ci = (
    xbar - z_crit * sigma_known / math.sqrt(n),
    xbar + z_crit * sigma_known / math.sqrt(n),
)

xbar, z_obs, p_two_sided, ci
(1.1249403105684774,
 3.5573736131335747,
 0.0003745812231249413,
 (0.5051452782639159, 1.744735342873039))
# Bayesian update for μ with known σ (Normal–Normal)
mu0 = 0.0
tau0 = 1.5  # prior std dev
sigma = sigma_known

xbar = data.mean()
tau_n2 = 1.0 / (1.0 / tau0**2 + n / sigma**2)
mu_n = tau_n2 * (mu0 / tau0**2 + n * xbar / sigma**2)
tau_n = math.sqrt(tau_n2)

mu_n, tau_n
(1.077070510118755, 0.309426373877638)
# Visualize prior vs posterior over μ
mu_grid = np.linspace(mu_n - 5 * tau0, mu_n + 5 * tau0, 600)

prior = stats.norm(loc=mu0, scale=tau0)
post = stats.norm(loc=mu_n, scale=tau_n)

fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_grid, y=prior.pdf(mu_grid), mode="lines", name="prior"))
fig.add_trace(go.Scatter(x=mu_grid, y=post.pdf(mu_grid), mode="lines", name="posterior"))
fig.update_layout(title="Bayesian update for μ (σ known)", xaxis_title="μ", yaxis_title="density")
fig.show()
# Generative modeling example: 2D correlated Gaussian via a linear transform
n = 3_000
mu_vec = np.array([1.0, -1.0])
Sigma = np.array([[1.0, 0.8], [0.8, 2.0]])
L = np.linalg.cholesky(Sigma)

z = sample_norm_box_muller(2 * n, loc=0.0, scale=1.0, rng=rng).reshape(n, 2)
x = mu_vec + z @ L.T

df = {"x1": x[:, 0], "x2": x[:, 1]}
fig = px.scatter(df, x="x1", y="x2", opacity=0.35, title="Samples from a correlated 2D Gaussian")
fig.update_layout(xaxis_title="x1", yaxis_title="x2")
fig.show()

x.mean(axis=0), np.cov(x.T)
(array([ 0.9797, -1.0441]),
 array([[1.0095, 0.8152],
        [0.8152, 1.9817]]))

11) Pitfalls#

  • Invalid parameters: \(\sigma\le 0\) is not allowed. In code, guard against non-positive scale.

  • Overconfidence in normality: real data may be skewed, heavy-tailed, or multi-modal. Diagnose with histograms/QQ-plots; consider alternatives (e.g., Student-\(t\), mixtures, robust losses).

  • Outliers: Gaussian likelihoods heavily penalize large residuals, so a few outliers can dominate fits.

  • Numerical issues in the tails: cdf/sf may underflow to 0; prefer logcdf/logsf or work in log-space.

  • Sampling edge cases: Box–Muller requires \(U_1>0\); clip u1 away from 0 to avoid log(0).

12) Summary#

  • norm is a continuous distribution on \(( -\infty,\infty )\) with parameters \(\mu\in\mathbb{R}\), \(\sigma>0\).

  • PDF: bell-shaped and symmetric; \(\mu\) shifts, \(\sigma\) spreads.

  • Key formulas: \(\mathbb{E}[X]=\mu\), \(\mathrm{Var}(X)=\sigma^2\), \(M_X(t)=\exp(\mu t + \tfrac12\sigma^2 t^2)\), \(H=\tfrac12\ln(2\pi e\sigma^2)\).

  • MLE: \(\hat\mu=\bar x\), \(\hat\sigma^2 = \tfrac1n\sum(x_i-\bar x)^2\).

  • For tails, prefer stats.norm.logcdf/logsf over taking log of cdf/sf.